fix(docling)!: change default export type to MARKDOWN and add page_number to chunk metadata by SyedShahmeerAli12 · Pull Request #3276 · deepset-ai/haystack-core-integrations

SyedShahmeerAli12 · 2026-05-06T07:28:55Z

Related Issues

fixes Make DoclingConverter metadata consistent with other converters #3256

Proposed Changes:

ExportType.MARKDOWN is now the default export type (previously DOC_CHUNKS), aligning DoclingConverter with Haystack's convention of separating conversion from chunking. Users who want chunked output should pass export_type=ExportType.DOC_CHUNKS explicitly.
MetaExtractor.extract_chunk_meta() now extracts page_number from chunk provenance info, making chunk metadata consistent with other Haystack splitters like DocumentSplitter.

How did you test it?

All 34 existing unit tests pass
Added 2 new unit tests: test_extract_chunk_meta_includes_page_number and test_extract_chunk_meta_page_number_uses_minimum

Notes for the reviewer

This is a breaking change: the default export_type has changed from DOC_CHUNKS to MARKDOWN. Existing pipelines that relied on the default without setting it explicitly will need to add export_type=ExportType.DOC_CHUNKS.
dl_meta is preserved in chunk metadata for backward compatibility alongside the new page_number field.

Checklist

I have read the contributors guidelines and the code of conduct
I have updated the related issue with new insights and changes
I added unit tests and updated the docstrings
I have used one of the conventional commit types for my PR title: fix:

…ber to chunk metadata - ExportType.MARKDOWN is now the default (was DOC_CHUNKS), aligning with Haystack convention of separating conversion from chunking - MetaExtractor.extract_chunk_meta now extracts page_number from chunk provenance, making metadata consistent with other Haystack splitters

SyedShahmeerAli12 · 2026-05-06T07:39:31Z

hey @bogdankostic Resolves #3256 happy to get feedback on this!

bogdankostic

Thank you @SyedShahmeerAli12! I added a comment about reverting the additions to the changelog as these are added automatically.

Also, I was wondering if we could add more metadata as pointed out in the issue like split_id and split_start_idx.

bogdankostic · 2026-05-06T14:29:30Z

Please revert these changes - the changelog will be populated automatically when a new released is triggered.

…ert CHANGELOG

SyedShahmeerAli12 · 2026-05-06T20:08:31Z

@bogdankostic Both points addressed ......

CHANGELOG: reverted removed the manually added [Unreleased] section
split_id / split_idx_start: added both fields to chunk metadata in the DOC_CHUNKS branch of run() (alongside the existing page_number). split_id is the 0-based chunk index and split_idx_start is the cumulative character offset based on chunk.text length both reset per source document, matching the behaviour of Haystack's DocumentSplitter. Tests updated and all 36 passing.

bogdankostic

Thanks @SyedShahmeerAli12, looking good to me! :)

SyedShahmeerAli12 requested a review from a team as a code owner May 6, 2026 07:28

SyedShahmeerAli12 requested review from bogdankostic and removed request for a team May 6, 2026 07:28

github-actions Bot added integration:docling type:documentation Improvements or additions to documentation labels May 6, 2026

bogdankostic requested changes May 6, 2026

View reviewed changes

bogdankostic changed the title ~~fix(docling): change default export type to MARKDOWN and add page_number to chunk metadata~~ fix(docling)!: change default export type to MARKDOWN and add page_number to chunk metadata May 6, 2026

fix(docling): add split_id and split_idx_start to chunk metadata; rev…

ab4121c

…ert CHANGELOG

bogdankostic approved these changes May 11, 2026

View reviewed changes

bogdankostic merged commit 28ada67 into deepset-ai:main May 11, 2026
16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(docling)!: change default export type to MARKDOWN and add page_number to chunk metadata#3276

fix(docling)!: change default export type to MARKDOWN and add page_number to chunk metadata#3276
bogdankostic merged 2 commits into
deepset-ai:mainfrom
SyedShahmeerAli12:fix/docling-metadata-consistency

SyedShahmeerAli12 commented May 6, 2026 •

edited

Loading

Uh oh!

SyedShahmeerAli12 commented May 6, 2026 •

edited

Loading

Uh oh!

bogdankostic left a comment

Uh oh!

bogdankostic May 6, 2026

Uh oh!

SyedShahmeerAli12 commented May 6, 2026 •

edited

Loading

Uh oh!

bogdankostic left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

SyedShahmeerAli12 commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Related Issues

Proposed Changes:

How did you test it?

Notes for the reviewer

Checklist

Uh oh!

SyedShahmeerAli12 commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bogdankostic left a comment

Choose a reason for hiding this comment

Uh oh!

bogdankostic May 6, 2026

Choose a reason for hiding this comment

Uh oh!

SyedShahmeerAli12 commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bogdankostic left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SyedShahmeerAli12 commented May 6, 2026 •

edited

Loading

SyedShahmeerAli12 commented May 6, 2026 •

edited

Loading

SyedShahmeerAli12 commented May 6, 2026 •

edited

Loading